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Orchestra Festival Evaluations: Interjudge Agreement And 
Relationships Between Performance Categories And Final Ratings 

ByBany R Gannau^J. David Boyle, Nicholas /. DeCarbo, 
School of Music, University of Miami 



Abstract 

The purpose of the study was to (a) examine the interjudge 
reliability for five groups of judges on seven rating categories 
on a band/orchestra adjudication form and (b) determine the 
extent to which category ratings are interrelated. Specific 
questions addressed in the study were: 

1 .What are the interjudge reliability coefficients for each 
category and for the final ratings? 

2. What are the correlation coefficients between each of 
the performance category ratings and the final ratings 
for individual judges and for the combined judges? 

3. To what extent do ratings for the performance 
categories influence the final ratings? 



4. Are there differences in judges' mean ratings for each 
of the performance categories and for the final ratings? 

Interjudge reliability coefficients for th^ee sets of judges 
v/ere marginally acceptable (in the .80s); those for tlie other 
tv/o sets of judges (.67 and .54) were not. Interjudge reliability 
coefficients for the various category ratings were generally 
much lower than those for the final ratings. 



Two performance categories (technique and intonation) 
were the best predictors of final ratings. The categories 
"selecUon" and "general effect" contributed nothing toward 
predicting the final ratings. 



Band, orchestra, and choir festival evaluations are a regular part 
of many secondary school music programs, and most such festivals 
engage adjudicators who rate each group's performance. Whether 
such festivals are actually competitions may be a matter of perspec- 
tive. If a festival is structured so that adjudicators rate a perfor- 
mance in relation to the performances of other groups, say for a 
first place award, perhaps the festival is a competition. On the 
other hand, if each group is evaluated in relation to some fixed 
standard irrespeci-ve of the evaluations of other groups participat- 
ing in the festival, perhaps it should not be considered a competi- 
tion. Of course there is always the question whether adjudicators 
can totally isolate a group's performance from that of others. 
Despite efforts to keep the focus on providing feedback for im- 
proving the performance of thie participating ensembles, perhaps 
there will always be an elcmen' of competition in secondary school 
music festivals. 

Because music ensemble performance is complex and multi- 
dimensional, it does not lend itself readily to precise measurement; 
generally, musical performances are evaluatedsubjectively, that is, 
reflecting either consciously or subconsciously the criteria that an 
individual evaluator considers most important. Allowing in- 
dividual adjudicators to employ their own criteria in evaluating 
performance festivals, however, presents some potential 
problems. To help alleviate these problems, most performance 
festivals do two things: (a) employ more than one adjudicator and 
(b) ask adjudicators to consider a common set of performance 
categories in arriving at a final rating. 

Most adjudication forms include about six performance 
categories or "standards" against which adjudicators are asked to 
provide ratings. Typical performance categories are tone, intona- 
tion, technique, balance, interpretation, musical effect, and "other 
factors." Adjudicators usually are provided with descriptions of the 
various standards and then asked to rate each performance against 
the standard. Most festivals require adjudicators to employ a 
five-point, or five-category, rating scale. The five rating categories 
usually are designated by Roman numerals, I through V, and they 
may be given parallel verbal labels such as superior (I), excellent 



(II), good (III), fair (IV), and poor or unacceptable (V). Others, as 
in the case of the present study, may use letter ratings. 

An assumption of this procedure is that each adjudicator's 
ratings of a performance relative to the performance categories or 
standards will more-or-less provide the basis for the respective 
adjudicator's rating of the performance, which in turn is averaged 
with those of tv/o other adjudicators to determine a final rating for 
each performance. Whether ratings for the respective perfor- 
mance categories indeed provide the real basis for overall ratings 
is questionable. Furthermore, research on the effects of the vari- 
ables in the adjudication process is inconclusive. 

Related Literature 

Several studies have examined the relationship betv^^een certain 
judge characteristics and evaluation ability as measured by the 
reliability of individual judges. Other measures examined in some 
of these studies are the internal consistency of evaluation forms 
and agreement among judges. Other studies have tested various 
procedures for their effectiveness in increasing judge reliability. 

Vasil (1973) and Massel (1978) examined the differences in 
judges* rank orderings between tape recorded versions of perfor- 
mances and live (Vasil) or videotaped (Massel) versions. The 
difference was found to be minimal for the top fifteen of thirty- 
three performances (Vasil) and all of twenty- two performances 
(Massell). 

Fiske (1977) investigated the relationships among reliability of 
music performance, adjudication, judge performance ability, and 
judge nonperformance music achievement. Thirty-three recent 
music education graduates rated a series of tape recorded trumpet 
performances twice. Results showed no relationship between 
performing ability and judge reliability or between performing 
ability and judge nonperformance music achievement. There was 
a significant inverse relationship between judge reliability and 
nonperformance music achievement. 

Fiske (1978) investigated the use of training sessions to increase 
judge reliability. Although no significant effects were found, Fiske 
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suggests that training sessions may prove effective with instruction- 
al revisions and the use of stronger motivating factors, 

Mullin (1979) examined the relationship between musical quan- 
titative decision-making ability as measured by subtests of aptitude 
tests by Seashore (Seashore, Lewis, andSactveit, 1939, i960) and 
Gordon (1965) and qualitative decision-making ability as 
measured by investigator-designed reliability measures. Results 
showed no relationship between correct response to aurally 
presented test items and evaluation reliability. It was concluded 
that two different independently functioning problem-solving 
strategies are present in the evaluation process. 

Towers (1980) investigated the effect of judge age and musical 
experience on reliability. For a set of judges ranging in age from 
seven years to "adult," there was a significant trend towards in- 
creased improvement in reliability with increasing age. Musical 
experience also had a significant effect on judge reliability. 

The purpose of a study by Burnsed, Kinkle, and King (1985) 
was to determine (a) the internal consistency of a performance 
evaluation form, (b) the interjudge reliability of groups of judges, 
and (c) the significance of performance rating categories as predic- 
tors of final ratings at selected adjudication festivals. The reliabilly 
of the adjudication form used was found to be very high. Out of 
four groups of judges, agreement was low for three groups on 
ratings of tone and for two groups on ratings of intonation and 
balance. A significant colincar relationship was found between all 
seven ratings (tone, intonation, technique, balance, interpretation, 
musical effect, and final). Of the six performance categories, 
musical effect correlated highest with the final rating. It was 
concluded that judges tend to evaluate performances from a global 
perspective and that performance categories may not represent 
separate entities. 

Burnsed and King (1987) continued the 1985 investigation, 
adding data from additional ensembles and groups of judges to 
their previous data. Interjudge agreement in all nine groups was 
again high for final ratings, but low in six groups on ratings of lone 
and in five groups on ratings of intonation. A correlation of all 
performance ratings revealed that performance category ratings 
and final ratings were so closely related as to represent a global 
rating. Again, musical effect correlated highest with the final rating. 
As with the 1985 study, the investigators concluded that certain 
category ratings, including tone, intonation, and balance, may be 
viewed with some skepticism, since judges appear to base their 
ratings on a single, global evaluation. 

Pxirpose 

The purpose of the present study was to (a) examine the 
interjudge reliability for five groups of judges regarding seven 
rating categories found on a band/orchestra adjudication form 
published by the Music Educators National Conference, and (b) 
determine the extent to which category ratings are interrelated. 
Specific questions addressed in the study were: 

1. What are the interjudge reliabilty coefficients for each 
category and for the final ratings? 

2. What are the correlation coefficients between each of the 
performance category ratings and the final ratings for in- 
dividual judges and for the combined judges? 

3. To what extent do ratings for the performance categories 
influence the final rating for combined judges? 



4. Are there differences in judges' mean ratings for each of 
the performance categories and for the final ratings? 

Methodology 

Data for the study are based on the adjudication results for the 
Dade County Orchestra Festivals held during the 1980s. Forms for 
1980 through 1990 were originally examined, but substantial dif- 
ferences in the forms for 1980 and 1981 precluded the use of those 
data in the study; also, complete data for all three adjudicators were 
not available for 1984, 1985, and 1988. Consequently, the present 
study is based on data for the 1983, 1986, 1987, 1989, and 1990 
Dade Count)' Orchestra Festivals. Ratings were available for 13 
orchestras in 1983, 13 in 1986, 15 in 1987, 18 in 1988, and 13 in 
1990, Although the ratings were originally given on a five-letter 
scale, they were converted to a five-number scale for statistical 
analysis, i.e. A (Superior), B (Excellent), C (Good), D (Fair) and E 
(Poor) ratings were treated numerically as 4, 3, 2, 1, and 0 respec- 
tively. 

Four basic statistical analyses were conducted on each year's 
resulti. Pearson product-moment correlations between the inde- 
pendent judges' final ratings were used to provide a measure of 
interjudge reliability. Interjudge reliabilities were determined for 
all category ratings and for the final ratings for each yeai's judges 
by calculating Pearson product-moment correlations between 
each pair of judges and averaging the three coefficients. 

Pearson product-moment correlations also were calculated be- 
tween each judge's ratings of the respective performance 
categories and the final rating. Thus, each judge's ratings and the 
mean of the three judges' ratings for the performance categories of 
tone, intonation, technique, selection, interpretation, and general 
effect were correlated with the final overall rating. 

A step- wise multiple regression analysis was calculated to as- 
certain Jie extent to which the combined judges' ratings for the 
various performance categories "accounted for" or predicted the 
final rating. The six performance categories were the independent 
variables and the final overall rating was the dependent variable. 

A repeated-measures ANOVA was used to compare the judges' 
mean ratings for each of the performance categories and for the 
final ratings. 

Essentially, the data from these analyses provided information 
regarding the extent to which the three judges differed in their 
overall ratings of the orchestra performances for the six perfor- 
mance categories and for the final rating. 

Resxilts 

The results are presented as they pertain to the four basic 
questions of the study. 

Interjudge Reliability 

As shown in Table 1 , the interjudge reliability coefficients for 
the final ratings ranged from a low of .54 in 1989 to a high of .89 
in 1987. Generally, reliability coefficients for the final ratings were 
higher than for the various categories. 

Although not tested statistically, considerable year-to-year dif- 
ferences were apparent in the interjudge reliability coefficients for 
all categories and the final ratings. The year-to-year disparities in 
the interjudge reliability coefficients were greatest for the 
categories tone, general effect, and intonation. For tone, intona- 
tion, technique, interpretation, and final ratings, the interjudge 
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reliability coefficients for 1989 were much lower than for other 
years. In 1986 and 1987, however, the interjudge reliabilities for 
general effect were very low (.07 and .27, respectively). 
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Correlations Between Categories and Final Ratings 

With few exceptions, correlations between the individual 
judges' ratings and the final rating and between the combined 
judges' ratings and the final rating were statistically significant. 
With two exceptions, all of the correlation coefficients between 
judges' ratings for tone and the final rating were above .70, with 
most in the .80s. All but four correlations between intonation and 
the final rating were above .70, for technique all but three, and for 
interpretation all but three. For selection, however, the correla- 
tions with final ratings were generally lower, with ten of the 20 
correlation coefficients below .70. The correlations for general 
effect were both lower and more disparate, (see table 2) 
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PredicUon of Final RaUngs from Category Ratings 

The multiple regression analyses sought to determine the ex- 
tent to which combined judges' category ratings predicted the final 
ratings. With separate analyses conducted for each yearns ratings, 
technique proved to be the best piedictor for all years. (See Table 
3.) Generally, the addiUon of other category ratings did not 
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increase the multiple correlation coefficient greatly. The greatest 
increases in the adjusted values after all variables were added 
in the regression analysis were for 1986 (from .69 to .86) and for 
1987 (from .74 to .89); the least change was for 1983 (from .87 to 
.93). 



.80 
.87 
.89 
.90 



151.00 
124.77 
102.20 
85.18 



1.37 
2,36 
3.35 
4,34 



.001 
.001 
.001 
.001 



.90 69.51 6.33 .001 



Mean Ratings 



Catcgor)'/ 
Year 

Tone 



.\o. of 
Orcb. 



Judge Jucge 
1 



Judge 
3 



E E 



1983 


13 




9 9.^ 




1986 


13 


2.52 


3.08 


2.(P 


1987 


15 


2.72 


2.80 


2.56 


1989 


18 


3.01 


3.4o 




1 Qon 


1 ^ 


2.75 


2.85 


3.03 


Intonation 










1 QR^ 


1^ 


1.66 


2.48 


2.63 


1986 


13 


2.47 


2.82 


2.17 


1987 


15 


2.38 


2.51 


2.31 


1989 


18 


2.20 


2.78 


3.04 


1990 


13 


2.59 


2.69 


2.34 


Technique 










1983 


13 


2.36 


3.03 


2.83 


1986 


13 


2.57 


3.31 


2.72 


1987 


15 


2.63 


2.53 


2.74 


1989 


18 


3.23 


3.16 


3.15 


1990 


13 


2.52 


2.77 


2.62 


Selection 










1983 


13 


3.39 


3.37 


3.39 


1986 


13 


2.77 


3.46 


3.08 


1987 


15 


3.55 


2.95 


3.20 


1989 


18 


3.44 


3.68 


3.61 


1990 


13 


3.85 


3.54 


3.34 


Interpretation 








1983 


13 


2.71 


3-03 


3.03 


1986 


13 


3.05 


3.08 


2.77 


1987 


15 


3.23 


2.69 


2.92 


1989 


18 


3.28 


3.17 


3.33 


1990 


13 


3.00 


2.92 


2.69 



2.37 
1.82 
0.53 
1.73 
0.32 



4.58 
1.92 
0.29 
4.56 
0.33 



1.93 
3.18 
0.37 
0.05 
0.20 



.11 
.18 
.59 
.19 
.77 



.02* 

.15 
.75 
.02* 

.72 



.16 

.05(4) 
.69 
.95 
.82 



0.00 
2.73 
2.64 
0.43 
2.68 



0.68 
0.85 
2.38 
0.27 
0.33 



.99 
.08 
.08 
.65 
.08 



.51 
.44 
.11 
.77 
.72 



General Effect 



Final 



1983 


13 


2.69 


3.18 


3.32 


2.56 


.09 


1986 


13 


3.31 


3.46 


4.00 


5.15 


.or 


1987 


15 


2.97 


3.15 


4.00 


10.60 .01* 


1989 


18 


3.43 


3.72 


3.37 


1.30 


.28 


1990 


15 


3.15 


3.08 


3.13 


0.02 


.98 


1983 


13 


2.43 


3.00 


2.92 


1.58 


.22 


1986 


13 


2.69 


2.85 


2.85 


0.17 


.85 


1987 


15 


2.87 


2.69 


2.93 


0.42 


.66 


1989 


18 


3.32 


3.08 


3.44 


1.19 


.31 


1990 


13 


2.77 


2.92 


2.85 


0.08 


.92 



22 



No. 2— Fall 1991 



RESEARCH PERSPECTIVES 



The greatest increases were after the addition of the second 
variable, which for 1986, 1987, and 1989 was interpretation. Al- 
though there were again year-to-year differences, other variables 
seemed to add little to the predictive strength of technique and 
interpretation. Selection radngs added virtually no predictive 
value for any of the five years. 

Comparison of Judges* Mean Ratings 

Whereas the interjudge reliability data reflected relationships 
between judges' ratings, the repeated-measures ANOVA was used 
to compare the judges' mean ratings for each catcgor)' and their 
respective final ratings. As shown in Table 4, there were no 
statistically significant differences among the judges' mean ratings 
during any year for tone, technique, selection, interpretation, or 
final rating. The only staUstically significant differences in the 
judges' mean ratings were for intonation in 1983 and 1989 and for 
general effect in 1986 and 1987. 

Discussion 

As apparent from the data, the interjudge reliabilities for both 
the final ratings and the category ratings were unsatisfactory. 
Certainly, interjudge reliability coefficients for final ratings should 
be much higher than those observed in the present study. The 
reliability coefficients for 1986, 1987, and 1990, which were in the 
.80s, were marginally acceptable, but the reliability coefficients for 
1983 (.67) and 1989 (.54) were clearly unacceptable. While it is 
expected that judges will vary some in the standards they apply in 
adjudication festivals, and hence the level of ratings, there should 
be a consistency of rating from orchestra to orchestra. 

Interjudgereljabilitycoefficientsforthevariouscategor^'ratings 
were even iowerthan those for the final ratings; they also reflected 
grelaer year-to-year variation than the final ratings. General effect 
reliability ratings were the lowest, suggesting that individual 
judges have differing interpretations of general effect. A further 
examination of the data revealed that for both 1986 and 1987 at 
least one judge failed to make any distinctions among the or- 
chestras for the general effect category. Why no distinctions were 
made is unclear, but an examination of the verbal descriptors on 
the adjudication form may have been a factor. The two verbal 
descriptors for the general effect category were "stage presence" 
and "aaistry," which are seemingly incongruous. General effect 
apparently has various meanings to different judges and may be 
redundant with the categories and their descriptors. The descrip- 
tors "stage presence" and "artistry" provide little guidance to the 
adjudicators. 

Other categories reflecting very low (less than .40) interjudge 
reliability coefficients for at least one year were tone and intona- 
tion. Fuaher research revealed that the judges for the years with 
the very low reliabiliaes included relatively inexperienced ad- 
judicators who also had different musical training and experience. 
One was an experienced public school strings/orchestra teacher, 
one was a pcrformerand applied music teacher, and the other was 
a composer/theorist who had some experience as a youth or- 
chestra conductor. Perhaps their varied professional backgrounds 
were contributing factors to the way they judged tone and intona- 
tion of the orchestras. For the other years, howe^^er, interjudge 
reliability coefficients for tone and intonation were generally in the 
.70s, which is considerably better, but still less than desired. 

While not reflected in the data, another factor may have had 
some bearing on the ratings. This factor, which was learned from 



talking with some of the adjudicators, concerned information 
extraneous to the musical performance which had been conveyed 
to judges. Apparently judges were informed that some orchestra 
programs were in early stages of development, and perhaps the 
judges varied in the extent to which they sought to be "encourag- 
ing" or "accommodating" in their ratings of these programs. 

Judges' individual and combined ratings for the tone, intona- 
tion, technique, and interpretation categories tended to correlate 
well with the final rating, while correlations for the selection and 
general effect categories were much lower. The lower correlations 
undoubtedly are a partial reflection of the inconsistencies among 
the judges' ratings for the general effect and selecdon categories, 
but they may also be reflections of the different nature of these two 
categories. The problems with the descriptors for the general 
effect category were noted above, but the fact that both the 
selection and the general effect categories take into consideration 
variables other than performance also may contribute to the low 
correladons. Perhaps the profession should re-examine the need 
for these categories on adjudication forms; at the very least, the 
descriptors for the categories need re-thinking. 

The data from the regression analysis are an extension of the 
correlational data between the categories and the final ratings. For 
three of the five years examined, the four performance categories 
(technique, intonation, interpretation, and tone) accounted for 
most of the predictive value. Selection and general effect con- 
tributed virtually nothing toward predicting the final ratings. 

Data from the comparison of the judges' mean ratings of the 
orchestras for the respective categories, however, arc somewhat 
more encouraging. The judges' mean ratings were not significant- 
ly different for any year's final ratings for tone, technique, inter- 
pretation, selection, general effect, and the final ratings. 
Apparently, judges are in general agreement with respect to the 
levels at which they rate the groups overall. 



Implications 

1. The profession needs to re-examine some of the categories, 
especially general effect and selection, on such adjudication 
forms. There is a particular need to re -think the descriptors 
for the various categories and to include descriptors that will 
have a common meaning for all adjudicators. 

2. Guidelines for adjudicators need to provide more and better 
information regarding the us^t of the categories in arriving at 
a final rating. Descriptors for the various categories should 
be well defined. 

3. Some type of adjudicator orientation should be developed to 
ensure that adjudicators have a common understanding of 
the terms, the categories, and their use in arriving at the final 
ratings. 

4. Festival managers should be careful to avoid any comments 
either prior to or during the evaluation festival that might in- 
advertently bias judges toward leniency in their ratings of or- 
chestras from programs that are in early stages of 
development. Evaluation festivals should provisde as ac- 
curate and objective ratings as possible. 
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